Notebook

CSCI-UA 9473 Introduction to Machine Learning¶

Fall 2022¶

Assignment 2: Fast Iterative Hard Thresholding and Convolutional nets¶

Given date: Wednesday October 19 Due date: Friday November 4

Total: 30pts¶

Additional readings (To go further):

Ian Goodfellow and Yoshua Bengio and Aaron Courville, Deep Learning
Amir Beck, Marc Teboulle, A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems

The assignment is divided into three parts. In the first part, we will go back to neural networks. You will be asked to build and train a convolutional neural network for image classification. In the second part, we will focus on the max margin classifier and study how such a classifier can be learned by means of gradient descent. Finally, in the last part, we will implement a principal component decomposition of a video sequence to extract moving targets from their background.

Part I: Solving LASSO

Question 1 Least Absolute Shrinkage and Selection Operator¶

(13pts)

Learning a model through the OLS loss can be done very efficiently through either gradient descent or even through the Normal equations. The same is true for ridge regression. For LASSO however, the non differentiability of the absolute value at $0$ makes the learning more tricky.

One approach, known as ISTA (Iterative Shrinkage-Thresholding Algorithm) consists in combining traditional gradient descent steps with a projection onto the $\ell_1$ norm ball. Concretely, for LASSO

\begin{align} \ell(\boldsymbol \beta) = \|\boldsymbol X\boldsymbol \beta - \boldsymbol t\|^2_2 + \lambda \|\boldsymbol \beta\|_1 \end{align}

where the data has centered so that $\beta_0 = 0$. I.e. \begin{align} \mathbf{x}^{(i)} \leftarrow \mathbf{x}^{(i)}- \frac{1}{N}\sum_{i=1}^{N} \mathbf{x}^{(i)}\\ t^{(i)} \leftarrow t^{(i)} - \frac{1}{N}\sum_{i=1}^N t^{(i)} \end{align}

The ISTA update takes the form

\begin{align} \boldsymbol \beta^{k+1} \leftarrow \mathcal{T}_{\lambda \eta} (\boldsymbol \beta^{k} - 2\eta \mathbf{X}^T(\mathbf{X}\mathbf{\beta} - \mathbf{t})) \end{align}

where $\mathcal{T}_{\lambda \eta}(\mathbf{x})_i$ is the thresholding operator defined component-wise as

\begin{align} \mathcal{T}_{\lambda \eta}(\mathbf{\beta})_i = (|\beta_i| - \lambda \eta)_+ \text{sign}(\beta_i) \end{align}

In the equations above, $\eta$ is an appropriate step size and $(x)_+ = \max(x, 0)$

Question 1.1. (5pts)¶

Complete the function 'ISTA' below which must return a final estimate for the regression vector $\mathbf{\beta}$ given a feature matrix $\mathbf{X}$, a target vector $\mathbf{t}$ (the function should include the centering steps for $\mathbf{x}_i$ and $t_i$) regularization weight $\lambda$, and the choice for the learning rate $\eta$.

In [ ]:

import numpy as np

def ISTA(beta_init, X, t, lbda, eta): 
     
    '''The function takes as input an initial guess for beta, a set  
    of feature vectors stored in X and their corresponding 
    targets stored in t, a regularization weight lbda,  
    step size parameter eta and must return the  
    regression vector following from the minimization of  
    the LASSO objective''' 
     
    
    return beta 

Question 1.2. (2pts)¶

Apply your algorithm to the data (in red) given below for polynomial features up to degree 6-7 and for various values of $\lambda$. Display the result on top of the true model (in blue). Note that for $\beta_0$ to be identically zero in the model including the higher degree features, the centering should be done after generating those features.

In [6]:

import numpy as np 
import matplotlib.pyplot as plt 
from math import sqrt
import numpy as np
from scipy import linalg
 
x = np.linspace(0,1,10) 
xtrue = np.linspace(0,1,100) 
t_true = 0.1 + 1.3*xtrue 
 
t = 0.1 + 1.3*x 
 
tnoisy  = t+np.random.normal(0,.1,len(x)) 
 

plt.scatter(x, tnoisy, c='r') 
plt.plot(xtrue, t_true) 
plt.show()

Question 1.3 FISTA (5pts)¶

It is possible to improve the ISTA updates by combining them with a so-called 'Nesterov acceleration'. The resulting update, known as FISTA, can read by letting $\mathbf{y}^{1} = {\boldsymbol \beta}^{(0)}$ (your first estimate for $\boldsymbol \beta$, e.g. a random vector), $\eta^1 = 1$ and then using (for $k \geq 1$)

\begin{align} \left\{ \begin{array}{l} &\boldsymbol{\beta}^{k} = \text{ISTA}(\mathbf{y}^{k})\\ &\eta^{(k+1)} = \frac{1+\sqrt{1+4(\eta^{(k)})^2}}{2}\\ &\mathbf{y}^{(k+1)} = \mathbf{\beta}^{(k)} + \left(\frac{\eta^{(k)} - 1}{\eta^{(k+1)}}\right)\left({\boldsymbol\beta}^{(k)} - {\boldsymbol\beta}^{(k-1)}\right)\end{array}\right. \end{align}

Here $\text{ISTA}$ denotes a single ISTA update.

Complete the function below so that it performs the FISTA iterations. Then apply it to the data given in question 1.2.

In [ ]:

def FISTA(X, t, eta1, beta0, lbda): 
     
    '''The function should return the solution to the 
    minimization of the LASSO objective 
    ||X*beta - t||_2^2 + lambda*||beta||_1 by implementing the 
    FISTA updates''' 
    
   
    return final_beta

Question 1.4. (2pts)¶

Apply your implementation of FISTA to the data (in red) given below for polynomial features up to degree 6-7 and for various values of $\lambda$. Display the result on top of the true model (in blue). Note that for $\beta_0$ to be identically zero in the model including the higher degree features, the centering should be done after generating those features.

In [ ]:

import numpy as np 
import matplotlib.pyplot as plt 
from math import sqrt
import numpy as np
from scipy import linalg
 
x = np.linspace(0,1,10) 
xtrue = np.linspace(0,1,100) 
t_true = 0.1 + 1.3*xtrue 
 
t = 0.1 + 1.3*x 
 
tnoisy  = t+np.random.normal(0,.1,len(x)) 
 

plt.scatter(x, tnoisy, c='r') 
plt.plot(xtrue, t_true) 
plt.show()

Question 1.5. (2pts)¶

Compare the ISTA and FISTA updates by plotting the evolution of the LASSO loss $\ell(\mathbf{\beta})$ as a function of the iterations for both approaches on a degree 5 model.

In [ ]:

import matplotlib.pyplot as plt


FISTA_loss = '''put your solution here'''
ISTA_loss = '''put your solution here'''

plt.plot(FISTA_loss)
plt.plot(ISTA_loss)
plt.show()

Part II: Convolutional Neural Network and Autonomous Driving

In this second part, we will use the Keras API to build and train a convolutional neural network to discriminate between four types of road signs. To simplify we will consider the detection of 4 different signs:

A '30 km/h' sign (folder 1)
A 'Stop' sign
A 'Go straight' sign
A 'Keep left' sign

An example of each sign is given below.

In [11]:

import matplotlib.pyplot as plt
import matplotlib.image as mpimg

img1 = mpimg.imread('1/00001_00000_00012.png')
plt.subplot(141)
plt.imshow(img1)
plt.axis('off')
plt.subplot(142)
img2 = mpimg.imread('2/00014_00001_00019.png')
plt.imshow(img2)
plt.axis('off')
plt.subplot(143)
img3 = mpimg.imread('3/00035_00008_00023.png')
plt.imshow(img3)
plt.axis('off')
plt.subplot(144)
img4 = mpimg.imread('4/00039_00000_00029.png')
plt.imshow(img4)
plt.axis('off')
plt.show()

Question 1 (10pts)¶

In the questions below, we will build and train a convolutional neural network to discriminate between the four images.

Before building the network, you should start by cropping the images so that they all have a common predefined size (take the smallest size across all images)
We will use a Sequential model from Keras but it will be up to you to define the structure of the convolution net. Initialization of the sequential model can be done with the following line

In [ ]:

model = Sequential()

1.a. Convolutions.¶

We will use a convolutional architecture. you can add convolutional layers to the model by using the following lines

In [ ]:

model.add(Conv2D(num_units, (filter_size1, filter_size2), padding='same',
                             input_shape=(3, IMG_SIZE, IMG_SIZE),
                             activation='relu'))
                     

for the first layer and

In [ ]:

model.add(Conv2D(filters, filter_size, activation, input_shape)

for all the others. 'filters' indicate the number of filters you want to use in the convolutional layer. filter_size is the size of each filter and activation is the usual activation that comes on top of the convolution, i.e. $x_{\text{out}} = \sigma(\text{filter}*\text{input})$. Finally input_shape indicates the size of your input. Note that only the input layer should be given the input size. Subsequent layers will automatically compute the size of their inputs based on previous layers.

1.b Pooling Layers¶

On top of the convolutional layers, convolutional neural networks (CNN) also often rely on Pooling layers. The addition of such a layer can be done through the following line

In [ ]:

 model.add(MaxPooling2D(pool_size=(filter_sz1, filter_sz2),strides=None))

The pooling layers usually come with two parameters: the 'pool size' and the 'stride' parameter. The basic choice for the pool size is (2,2) and the stride is usually set to None (which means it will split the image into non overlapping regions such as in the Figure below). You should however feel free to play a little with those parameters. The MaxPool operator considers a mask of size 'pool_size' which is slided over the image by a number of pixels equal to the stride parameters (in x and y, there are hence two translation parameters). for each position of the mask, the output only retains the max of the pixels appearing in the mask (This idea is illustrated below). One way to understand the effect of the pooling operator is that if the filter detects an edge in a subregion of the image (thus returning at least one large value), although a MaxPooling will reduce the number of parameters, it will keep track of this information.

Adding 'Maxpooling' layers is known to work well in practice.

Although it is a little bit up to you to decide how you want to structure the network, a good start is to add a couple (definitely not exceeding 4) combinations (convolution, convolution, Pooling) with increasing number of units (you do every power of two like 16, 32, 128,...).

1.c. Flattening and Fully connected layers¶

Once you have stacked the convolutional and pooling layers, you should flatten the output through a line of the form

In [ ]:

model.add(Flatten())

And add a couple (no need to put more than 2,3) dense fully connected layers through lines of the form

In [ ]:

model.add(Dense(num_units, activation='relu'))

1.d. Concluding¶

Since there are four possible signs, you need to finish your network with a dense layer with 4 units. Each of those units should output four number between 0 and 1 representing the likelihood that any of the four signs is detected and such that $p_1 + p_2 + p_3 + p_4 = 1$ (hopefully with one probability much larger than the others). For this reason, a good choice for the final activation function of those four units is the softmax (Why?).

Build your model below.

In [ ]:

model = Sequential()

# construct the model using convolutional layers, dense fully connected layers and 

Question 2 (3pts). Setting up the optimizer¶

Once you have found a good architecture for your network, split the dataset, by retaining about 90% of the images for training and 10% of each folder for test. To train your network in Keras, we need two more steps. The first step is to set up the optimizer. Here again it is a little bit up to you to decide how you want to set up the optimization. Two popular approaches are SGD and ADAM. You will get to choose the learning rate. This rate should however be between 1e-3 and 1e-2. Once you have set up the optimizer, we need to set up the optimization parameters. This includes the loss (we will take it to be the categorical cross entropy which is the extension of the log loss to the multiclass problem).

In [ ]:

from tensorflow.keras.optimizers import SGD
from tensorflow.keras.optimizers import Adam

# set up the optimize here
# Myoptimizer = SGD
# Myoptimizer = Adam

model.compile(loss='categorical_crossentropy',
              optimizer=Myoptimizer,
              metrics=['accuracy'])

Question 3 (2pts). Optimization¶

The last step is to fit the network to your data. Just as any function in scikit-learn, we use a call to the function 'fit'. The training of neural networks can be done by splitting the dataset into minibatches and using a different batch at each SGD step. This process is repeated over the whole dataset. A complete screening of the dataset is called an epoch. We can then repeat this idea several times. In keras the number of epochs is stored in the 'epochs' parameter and the batch size is stored in the 'batch_size' parameter.

In [ ]:

batch_size = '''set the size of the batch here'''
epochs = '''set number of epochs here'''

model.fit(X, t,batch_size=batch_size,epochs=epochs, validation_split=0.2)